Wrangling Data 1: dplyr
2024-10-17
|> or %>% for data manipulationWe should also keep to one type of observational unit per table
The 6 main verbs in dplyr:
ALL verbs follow the same format:
Also install/load:
Select variables by name
Select variables by name
| country | continent | year |
|---|---|---|
| Afghanistan | Asia | 1952 |
| Afghanistan | Asia | 1957 |
| Afghanistan | Asia | 1962 |
| Afghanistan | Asia | 1967 |
Select variables by name
| continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|
| Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Asia | 1967 | 34.020 | 11537966 | 836.1971 |
There are lots of helper functions that work with select, e.g.:
Using gapminder
Using starwars - load with data("starwars")
Using flights - load with library(nycflights13)
Using gapminder
Using starwars
select_if())Using flights
return rows with matching conditions
return rows with matching conditions
# combining multiple filters in a single statement
country_names <- c('Afghanistan', 'Angola', 'Belgium', 'China', 'Mauritania',
'Mauritius', 'Mongolia', 'Korea, Rep.', 'Sri Lanka',
'Saudi Arabia', 'Vietnam', 'Yemen, Rep.')
filter(gapminder, year == 1997L, continent != "asia", pop >= 1000000L,
country %in% country_names)| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 |
| Angola | Africa | 1997 | 40.963 | 9875024 | 2277.1409 |
| Belgium | Europe | 1997 | 77.530 | 10199787 | 27561.1966 |
| China | Asia | 1997 | 70.426 | 1230075000 | 2289.2341 |
combining filters
combining filters
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 |
| Albania | Europe | 1967 | 66.220 | 1984060 | 2760.1969 |
| Albania | Europe | 1997 | 72.950 | 3428038 | 3193.0546 |
Using gapminder
Using starwars
filter() and grepl()Using gapminder
Using starwars
arrange rows by variables
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Japan | Asia | 2007 | 82.603 | 127467972 | 31656.07 |
| Hong Kong, China | Asia | 2007 | 82.208 | 6980412 | 39724.98 |
| Japan | Asia | 2002 | 82.000 | 127065841 | 28604.59 |
| Iceland | Europe | 2007 | 81.757 | 301931 | 36180.79 |
note: order is important when using multiple variables!
Using gapminder
add new variables
Using starwars
Using flights
ifelse())Using starwars
str_to_title() from stringr)Using flights
na_if()Let’s combine lots of operations together to get the data we want!
It’s as simple as
…right???!
WRONG!! (ish)
Instead of this:
Do this:
Pipes:
|> (no need to load tidyverse)|> here, you could use %>% instead (but load dplyr first)%>% in a lot of code in the wild, so it’s good to know both%>% yourself if you’re forced to use an older version of RUsing starwars
tip: use is.character and ifelse() to create the new column following the format:
Using flights
Useful only in conjunction with other methods
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
Useful only in conjunction with other methods
reduces multiple values down to a single value
| mean_lifexp |
|---|
| 59.47444 |
Combining group_by() and summarise() gives us so much more power!
gapminder |>
select(-pop) |>
filter(continent == 'Oceania', year > 1980L) |>
arrange(desc(gdpPercap)) |>
group_by(country) |>
summarise(n_years = n(),
Mean_Life_Exp = mean(lifeExp),
SD_Life_Exp = sd(lifeExp),
Max_GDP = max(gdpPercap))| country | n_years | Mean_Life_Exp | SD_Life_Exp | Max_GDP |
|---|---|---|---|---|
| Australia | 6 | 78.17583 | 2.458826 | 34435.37 |
| New Zealand | 6 | 76.89233 | 2.552158 | 25185.01 |
We can use lots of built-in functions when using summarise:
Note: Don’t forget to include na.rm = TRUE if NAs are present
Using gapminder
Using airquality
Using airquality
summarise_if() and is.integer()Using starwars
print(n = x), where x is a suitably high number of rowsdplyr::glimpse() instead of str()everything() to re-order columnsn_distinct() instead of length(unique())dplyr::rename(new_name = old_name) for renaming columnsdplyr::select_() and other variantsdplyr::near() and dplyr::between()coalesce(), recode() and case_when() from dplyrtibble::rownames_to_column()dplyr::select()grepl() is better done using stringr’s str_detect() (introduced tomorrow)stringr (tomorrow) combined with dplyr (more tomorrow) and the native pipe (or maggritr pipe if needed) will cover a lot of your everyday needsWhat if:
Suggested Reading
browseVignettes(package = "dplyr")